14 research outputs found

    The fast multipole method at exascale

    Get PDF
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

    adPerf: Characterizing the Performance of Third-party Ads

    Get PDF
    Monetizing websites and web apps through online advertising is widespread in the web ecosystem. The online advertising ecosystem nowadays forces publishers to integrate ads from these third-party domains. On the one hand, this raises several privacy and security concerns that are actively studied in recent years. On the other hand, given the ability of today's browsers to load dynamic web pages with complex animations and Javascript, online advertising has also transformed and can have a significant impact on webpage performance. The performance cost of online ads is critical since it eventually impacts user satisfaction as well as their Internet bill and device energy consumption. In this paper, we apply an in-depth and first-of-a-kind performance evaluation of web ads. Unlike prior efforts that rely primarily on adblockers, we perform a fine-grained analysis on the web browser's page loading process to demystify the performance cost of web ads. We aim to characterize the cost by every component of an ad, so the publisher, ad syndicate, and advertiser can improve the ad's performance with detailed guidance. For this purpose, we develop an infrastructure, adPerf, for the Chrome browser that classifies page loading workloads into ad-related and main-content at the granularity of browser activities (such as Javascript and Layout). Our evaluations show that online advertising entails more than 15% of browser page loading workload and approximately 88% of that is spent on JavaScript. We also track the sources and delivery chain of web ads and analyze performance considering the origin of the ad contents. We observe that 2 of the well-known third-party ad domains contribute to 35% of the ads performance cost and surprisingly, top news websites implicitly include unknown third-party ads which in some cases build up to more than 37% of the ads performance cost

    CFDNet: a deep learning-based accelerator for fluid simulations

    Full text link
    CFD is widely used in physical system design and optimization, where it is used to predict engineering quantities of interest, such as the lift on a plane wing or the drag on a motor vehicle. However, many systems of interest are prohibitively expensive for design optimization, due to the expense of evaluating CFD simulations. To render the computation tractable, reduced-order or surrogate models are used to accelerate simulations while respecting the convergence constraints provided by the higher-fidelity solution. This paper introduces CFDNet -- a physical simulation and deep learning coupled framework, for accelerating the convergence of Reynolds Averaged Navier-Stokes simulations. CFDNet is designed to predict the primary physical properties of the fluid including velocity, pressure, and eddy viscosity using a single convolutional neural network at its core. We evaluate CFDNet on a variety of use-cases, both extrapolative and interpolative, where test geometries are observed/not-observed during training. Our results show that CFDNet meets the convergence constraints of the domain-specific physics solver while outperforming it by 1.9 - 7.4x on both steady laminar and turbulent flows. Moreover, we demonstrate the generalization capacity of CFDNet by testing its prediction on new geometries unseen during training. In this case, the approach meets the CFD convergence criterion while still providing significant speedups over traditional domain-only models.Comment: It has been accepted and almost published in the International Conference in Supercomputing (ICS) 202

    Review: Artificial Intelligence for Liquid-Vapor Phase-Change Heat Transfer

    Full text link
    Artificial intelligence (AI) is shifting the paradigm of two-phase heat transfer research. Recent innovations in AI and machine learning uniquely offer the potential for collecting new types of physically meaningful features that have not been addressed in the past, for making their insights available to other domains, and for solving for physical quantities based on first principles for phase-change thermofluidic systems. This review outlines core ideas of current AI technologies connected to thermal energy science to illustrate how they can be used to push the limit of our knowledge boundaries about boiling and condensation phenomena. AI technologies for meta-analysis, data extraction, and data stream analysis are described with their potential challenges, opportunities, and alternative approaches. Finally, we offer outlooks and perspectives regarding physics-centered machine learning, sustainable cyberinfrastructures, and multidisciplinary efforts that will help foster the growing trend of AI for phase-change heat and mass transfer

    BubbleML: A Multi-Physics Dataset and Benchmarks for Machine Learning

    Full text link
    In the field of phase change phenomena, the lack of accessible and diverse datasets suitable for machine learning (ML) training poses a significant challenge. Existing experimental datasets are often restricted, with limited availability and sparse ground truth data, impeding our understanding of this complex multiphysics phenomena. To bridge this gap, we present the BubbleML Dataset \footnote{\label{git_dataset}\url{https://github.com/HPCForge/BubbleML}} which leverages physics-driven simulations to provide accurate ground truth information for various boiling scenarios, encompassing nucleate pool boiling, flow boiling, and sub-cooled boiling. This extensive dataset covers a wide range of parameters, including varying gravity conditions, flow rates, sub-cooling levels, and wall superheat, comprising 79 simulations. BubbleML is validated against experimental observations and trends, establishing it as an invaluable resource for ML research. Furthermore, we showcase its potential to facilitate exploration of diverse downstream tasks by introducing two benchmarks: (a) optical flow analysis to capture bubble dynamics, and (b) operator networks for learning temperature dynamics. The BubbleML dataset and its benchmarks serve as a catalyst for advancements in ML-driven research on multiphysics phase change phenomena, enabling the development and comparison of state-of-the-art techniques and models.Comment: Submitted to Neurips Datasets and Benchmarks Track 202

    Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring

    Get PDF
    International audienceStealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our on-line approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications

    Scalable Communication Endpoints for MPI+Threads Applications

    No full text
    Hybrid MPI+threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. In the MPI-everywhere model, each MPI process has a dedicated set of communication resources (also known as endpoints), which is ideal for performance but is resource wasteful. With MPI+threads, current MPI implementations share a single communication endpoint for all threads, which is ideal for resource usage but is hurtful for performance. In this paper, we explore the tradeoff space between performance and communication resource usage in MPI+threads environments. We first demonstrate the two extreme cases---one where all threads share a single communication endpoint and another where each thread gets its own dedicated communication endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies in both these cases. Next, we perform a thorough analysis of the different levels of resource sharing in the context of Mellanox InfiniBand. Using the lessons learned from this analysis, we design an improved resource-sharing model to produce \emph{scalable communication endpoints} that can achieve the same performance as with dedicated communication resources per thread but using just a third of the resources
    corecore